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Abstract 

We consider Benham's model for strand separation in negatively 
supercoiled circular DNA, and study denaturation as function of the 
super helical density k < 0. We propose a statistical version of this 
model, based on bayesian segmentation methods of current use in 
bioinformatics; this leads to new algorithms with priors adapted to 
supercoiled DNA, taking into account the random nature of the free 
energies needed to denature hydrogen bonds. 
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1 Introduction 



Initiation of transcription in DNA requires the two strands of the double 
helix to separate, and strand separation is enhanced in negative supercoiled 
DNA. Benham(1979, 1990, 1996) proposes a mathematical model for the pro- 
cess of strand separation, based on statistical mechanics ideas, and develop 
algorithms to locate interesting sites along the DNA where strand separa- 
tion or replication is strongly favored (see e.g. Clote and Backhofen(2000)). 
These computational methods are Metropolis dynamics (Sun et altri(1995)) 
or exact methods relying on transfer matrices(Fye and Benham(1999)). In 
a typical state, some hydrogen bonds are broken; we are interested in the 
repartition of the droplets of denatured bonds, and on the nature of the bases 
situated in these domains. Our aim is to investigate the effect of the num- 
ber of droplets, of the degreee of negative super helicity of the DNA and of 
the concentration in A+T bonds on the equilibrium properties of Benham's 
model. This is the topic of Section [21 where homopolymers are treated ac- 
cording to the number of connected domains of denatured bonds: in Section 
12.1.11 we study denaturation when no restriction on the number of domains 
is imposed, and show that no robust and stable denatured state exists; this 
is the situation adapted to the algorithms of Fye and Benham(1999). In 
Section [2.1.21 we study the model when the number of domains rjy is such 
that vn/N — > 0, as N — > oo, where N denotes the number of bases of the 
DNA. We show the existence of a stable and robust denatured state when 
the duplex is sufficiently negatively supercoiled. This is the regime where the 
MCMC of Sun et altri applies. We next turn to heteropolymers in Section 
12.21 and study localized denaturation as function of the proportion of A+T 
bonds and of the level of negative superhelicity of the DNA. Section 01 fo- 
cus on the statistical aspects of the model: Section [3.11 translates Benham's 
model in a Bayesian framework, and Section 13.21 shows how Bayesian seg- 
mentation methods of current use in bioinformatics, as presented in Liu and 
Lawrence(1999), can be of interest in the strand separation problem. This 
also give new algorithms with priors adapted to supercoiled DNA, which 
take into account the random fluctuations of the free energies needed to 
denature A+T and G+C bonds. 

In what follows, we consider a spin system based on a circular graph 
with node set S, \S\ = N, N G N, and, for each site i E S, a spin cr, € 
{— 1,+1}. Benham's original model deals with a lattice gas, with binary 
random variables raj. We use both notations by setting raj = (<jj + l)/2, and 
use spins to link this model with known mean field models. The meaning 
of rii = (resp. cij = —1) is that the bases of the double helix at site 
i G S are linked by an hydrogen bond, the link is closed, and rij = 1 
(resp. Oi = +1) means that this bond is broken, the link is open. Then 
ra := ^j=i ni denotes the number of open bonds. The partitioning of the 
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DNA in domains allows the linking numbers to be regulated, where the 
linking number L of a configuration describes the way the duplex winds 
about the axis, assuming that the axis of the helix is planar. The twist 
T is the number of times the duplex revolves about its axis(see e.g. Clote 
and Backhofen(2000), chap. 6.2). When the DNA is relaxed, the so-called 
B-DNA state, a segment of N bases produces typically the characteristic 
linking number Lq = N/A, where the constant A is experimentally situated 
around 10.4. A negative supercoilded DNA is a configuration obtained from 
the B-DNA by cutting the strands using topoisomerases of type II; the 
strands then rotate around each others in the direction opposite of the twist 
of the helix, reducing then its linking number (L < Lq). This forces then 
the circular axis of the helix to wind, producing then a more twisted and 
compact configuration (see e.g. Lewin(1994)). This supercoiled state permits 
for example to put the helix in nuclei. Benham's model permits to quantify 
the way supercoiling enhances strand separation. 

Consider a supercoiled DNA with negative linking difference a = L — 
Lq < 0, imposed during the process. Assume that n links between bases 
are broken; the helix unwinds locally and thus increases its linking number 
to a + n/A. Because the strands car rotate around each others, the same 
process induced also a twist T, yielding a residual linking difference a r = 
a + n/A — T. The total twist T between separated regions is modeled as 

N 

T _ n i T j 

^ 2vr ' 

i=i 

where Tj G R is the local helicity. Benham's idea is to quantify all of these 
steps with free energy costs. The torsional free energy Gt is given by 

C N 

Gt = ^^2^, 

i=i 

for some stiffness coefficient C > 0. Consider a configuration in which n 
bonds are separated in r runs, that is in r connected components of open 
bonds. In what follows, 2r is the perimeter of the configuration. 

The free energy cost for separation is modeled as 

N 

G s = ar + '^2b i n i , a > 0, r = ^^(1 - n w ), 
i=l i 

where the parameters bj indicate the natures of the bases located at sites 
i G S: AT links are formed of 2 hydrogen bonds, and GC links consist in 3 
hydrogen bonds. If i G S is associated with an AT link, we set bi = bj^T and 
bi = bcc otherwise, with bcc > In the homopolymer case, bi = b > 0. 
This will act as an exterior magnetic field in the statistical approach; when 
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the bases are chosen at random on the DNA with some law, J2i n A can be 
viewed as a random external field. The free energy b needed to break the 
hydrogen bonds, and thus to separate a base pair, depends on the inverse 
temperature (3: 



where AH is the enthalpy of the reaction and f3 m is the inverse temperature 
associated to the melting temperature T m . Below T m , the field b is positive, 
this is the regime we are interested in. From Benham(1992), the pBR322 
DNA is such that AH AT = 7.25 kcal/mol, AH GC = 9.02 kcal/mol, and the 
melting temperature T m follows the law 



where x is some parameter and Fqc = for AT bonds and Fqc = 1 f° r GC 
bonds. When x = 0.01 and T = = 310 K, the resulting free energies 
are given by b at = 0.255 kcal/mol and bcc = 1-301 kcal/mol. 

Long range interactions appear with the fluctuations of the linking num- 
ber: it is known experimentally that the energy cost associated with the 
residual linking number for supercoiled DNA is given by 



Experimentally, the coefficient K is inverse proportional to the number of 
bases of the DNA; we thus set 



As N is large, basic statistical reasoning suggests to renormalize the variable 
n as n/N, to catch the thermodynamical limit. We thus introduce the 
superhelical density k by setting 



b = b(P) = AH(l-z£) 



(1) 



T m = 354.55 + 16.61og(x) + 41F GC 




T) 2 . 



a = kN. 



The overall free energy takes then the form 



n 1 



N N 



2NA N 



i=i i=i 



ni-arnrii+i). 



We use the local fields 26« insteed of bi for notational purpose. 
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2 Results on denaturation 



2.1 The homopolymer approximation 

In this paragraph, we suppose that b{ = b > and that r, = r G R. Set for 
convenience 

1 * 1 * 

Afjv = 77 n *' mjv = 77 zC CT « = 2M7V ~~ L 

i=l i=l 
Then the Hamiltonian of the system becomes 

G T = N(2bM N + ^M N + ^(« + (-L - ^)M W ) 2 ) + a# sing /4, 

Z Z zA Z7T 

where the index r of GV indicates the dependence on the torsion coefficient 
r and i/ging denotes the Hamiltonian associated with the nearest neighbor 
ferromagnetic Ising model in dimension 1 

N-i 



H sillg = - y~] o-jCJj+i =4r - N. 



i=i 

Before introducing the Gibbs measure of the system at inverse temperature 
[3 > 0, let us proceed as in Benham by averaging the system with respect to 
the torsion coefficient r. The Boltzman weight should be exp(— (3G T ). Av- 
eraging over t G R gives the integral j" R exp(— (3G T )dT, that is the effective 
Hamiltonian 

gl - NKM N + >,„ 8 + n J^^ + , (2) 

(see Fye and Benham(1999)). 



2.1.1 Arbitrary large perimeter 

In this approximation, we shall consider the behavior of Mn in the thermo- 
dynamical limit TV" — > oo under Gibbs measure 

7r^ jB ((7) = exp(-PHi(<r))/Z N (p,B), (3) 

where Zn((3,B) denotes the related partition function, and where -B = 
(6j)l<i<JV) &« = &j denotes the exterior field. We use mainly Laplace method 
bu using the large deviation rate function associated with the ferromagnetic 
nearest neighbor Ising model in dimension one. Notice the appearance of the 
magnetization M in the denominator of (J2J), which is quite unconventional. 
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Let 7Tp a be the Gibbs measure at inverse temperature (3 a = a/3/4 associated 
with the Hamiltonian -ff s ing- Then, for any function h : I := [0, 1] — ► R, 

<h(M N )eM-N(3F B (M N )) >7rfe 
< h(M N ) >^ B - <eM _ NpFB(MN))>7Tfja , W 

where 

F B (j/) = 6(2 J /-l) + G(y), yG I, (5) 

and 

27T 2 CK y 2 

Let yitjv be the law of mjv(<r) = 2Mjv(cr) — 1 under Gibbs measure 7^. Then 

J+ 1 /^(dz)fr(^) exp(-/3jV FB ( i±£)) 

< h(M N ) > Wfj B = - TT — . (6) 

J_i /Xjv(dz)exp(-/?AF B (ip)) 



Benham(1979) investigates the thermodynamics of supercoiled DNA, by 
minimizing free energies, and introduces critical thresholds of supercoiling. 
In this macroscopic approach, it is shown that sufficient negative supercoil- 
ing implies local denaturation. In Benham(1996), this work is extended to 
positively supercoiled DNA (k > 0); looking at the various plots contained 
in this work, we see the appearance of critical superhelical densities above 
which a positively supercoiled DNA remains intact at temperatures higher 
than the melting point. Similarly, we introduce the 

Definition 1 The order parameter of the system is 

< M N > 7r/3 B , or < m N > TftB , 

the magnetization of the spin system. We say that the system exhibits 
phase transitions when there exist critical superhelical densities R C (B) > 
n c (B) such that, in the large N limit, < >n /3 B ^ as k > R C (B), 

< Mjv > 7 r /3 B ^ 1 as n < k c {B), and lim < Mjy > W0 B G (0, 1) otherwise. 

Let Aj\r(A) be the logarithmic moment generating function 

Aat(A) := ln(7T /3a (exp(iVm A rA))), 

with (see e.g. Baxter (1982), p. 34) 

A .. 1 . / \ \ , ,e? a cosh(A) + y/eW° sinh(A) 2 + e" 2 ^ . 
Aoo A := hm -A N A = In ^ % . 

N^oo A e Pa + e Pa 

Then, according to Gaertner-Ellis Theorem, (see e.g. Dembo and Zeitouni(1992)) 
the law of mjv under -Kp a satisfies a large deviation principle with good rate 
function 

I S m S (z) = SUp(Az - Aoo(A)), 
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the Legendre transform of Aoo. = e@ a sinh /\/e 2/3a sinh 2 +e _2 ^ a , and 

^■oo = e/3a cosn /(e 2 ^ a sinh 2 +e~ 2l3a ). is thus strictly convex on R, and 
its Legendre transform is essentially smooth (see Theorem 26.3 in Rock- 
afellar(1972)). The derivative I' siag (z) tends to +00 when z converges to a 
boundary point of the domain of Jsing- From computation, 



r , , . .ze-^ + ^l + z^e-^-l), 
hmgiz) = zln{ 



eP a y/l + z 2 (e-W° _l) +e -^ 

\z\ ^ 1, and isingC^) = +°° when \z\ > 1. 

In the special case where a = 0, the rate function becomes the entropy 

hmg(z) = — ^ ln(l + z) + ln(l - z), \z\ < 1. 

The integrals appearing in © can be estimated through Laplace's method 
by rewritting the numerator heuristically as 

+1 dz/ l (i±^)exp(-iVJ( z )), 
1 z 



where we set 



J(z) := I sing (z) +/3 J p B (i±£), I*] < l. 



Then © is asymptotically equivalent to 

;+l dz/^) exp(-iV( J(z) - inf kKl J(z))) 



/+ 1 dzexp(-iV(J(z) - inf kKl J(z))) 

Theorem 13a unique z* £ (— 1,+1) minimizing J(z), with 

1 + z* 

< Afa > WftB — ► — J— ■ 

The model does not exhibit phase transitions in the sense of Definition 
^ This is a consequence of the stiffness of the rate function I s i ng (z) as 
\z\ — > 1: hm| z |^ 1 (d/dz)I s i n g(z) = +00, and the rate function J(z) can not 
be decreasing in the neighborhood of z = 1. The minima of J(z) are thus 
located in the interior of the unit interval. Suppose that the Ising measure 
is replaced by some probability measure Vn on Qjy, such that the law of 
Mjv under Vn satisfies a large deviation principle with strictly convex and 
smooth free energy function A; then its Legendre transform is essentially 
smooth, and again, using Varadhan's Theorem, one gets the rate function 
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J(z) — ini z J(z); similarly, the minima of J are located in the interior of the 
domain of the Legendre transform, and no phase transition occurs. 



Proof: Consider the probability measure 

f c hn(&z) eM-N(3F B ((l + z)/2) 



vn(C) 



j\ m (dz) exp(-N(3F B ((l + z)/2) ' 



for any Borel subset C of [—1,1]. Varadhan's Theorem (see Deuschel and 
Stroock, Theorem 2.1.10 and exercice 2.1.24) gives that un satisfies a large 
deviation principle with good rate function J(z). If J attains its infimum at 
a unique point of the interval, the sequence vjq converges weakly to the 
point mass 5 Zt . Consider first Fb{v), y € [0, 1], or equivalently Fs+2bKA+b, 
which is equal to 

2(bA 2 + 7T 2 C) (y-y )(y-yi) 
A 2 (y-y 2 ) ' 

where 

tt 2 CA fil t r , , 4ir 2 C 
y = -.A >0, yi = - Ko{bA2 + 7r2c) m + Kqk) and y 2 = - — . 

Notice that b > implies that y\ > y 2 . The function has a pole at y = y 2 < 
0, and two roots y\ and yo > 0. Then Fb is strictly convex on the half line 
(?/2,+oo). Thus J is strictly convex on [—1,1], with \\m\ z \^i J' (z) = +oo. 
The unique infimum is thus located in the interior of the interval. 



2.1.2 Limited perimeter 

Computations done in some theoretical studies (see e.g. Benham(1989), 
p. 268 and Benham(1990), p. 6302) or empirical studies( see e.g. Sun et 
altri(\995, p. 8658)) deal with the behavior of Mjy when 2r is fixed, or is 
small. In what follows, we give conditions on the growth of r = rjy ensuring 
the possibility of phase transitions, that is the possibility for the existence of 
a stable and robust denatured state when the superhelical density is small 
enough. 

In what follows, we condition on the event {a; \a\ = 2rjy}, where for any 
configuration a, \a\ denotes the perimeter of a, with \a\ = \i; Oi = — cr i+1 1 = 
2r, and H s - mg (a) = 2\a\ — N. Classical combinatorics (see e.g. Feller(1971)) 
shows that the number of configurations of length N with Y2iLi n i = n an< ^ 
perimeter 1 ^ r ^ N/2 is given by 
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Let U rj N be the uniform probability measure on the subset C r> jv of the cube 
£In consisting of spins of perimeter 2r. Let P r> N be the law of Mjv under 
U r>N , with supp(P rj Ar) = {r/N, r/N}, given by 

In— 1\ /N—n—l\ 
Z^r^n^N-r \r—l) \ r—1 ) 

Then the average < h(M) > n/3 B becomes 

£ r exp(-&r)|{H = 2r}| Er^N-r P r,N (n/N)h(n/N) exp(-iV/3F B (n/iV)) 
£ r exp(-p a r)\{\a\ = 2r}\ Er^N-r P r , N (n/N) exp(-N(3F B (n/N)) 

We will be concerned with integrals of the form 

C P rmN (dy)h(y) e W (-(3NF B (y)), 
Jo 

when r = rjy is such that r^/N — * as N — > oo. 

Lemma 1 Assume that 2r^ < N and that rjy /N — ► as N — > oo. Then 

when n = [pN], for < p ^ 1. The sequence of probability measures 
{Pr N ,N)n&N satisfies a large deviation principle with good rate function T r : 
R — ► [0, +oo) given by I r (y) = 0, y G I , and T r (y) = +oo, y G I c . 

Proof The first assertion is a consequence of Stirling's formula. Assume 
that A = (a, b) C I. For JV large enough, A D {r N /N, ■ ■ ■ , 1 - r N /N} ± 0, 
and A contains an element pn of the form p^ = [pN]/N for some < p < 1, 
with 

logP rjViiV ({/0Jv}) ^ logP rNjA r(A) ^ 0. 
Using and (JHJ), it remains to check that 

e (;)r;"))^o. 

r^n^N-r v 7 v 7 

But, as the sum larger than one, 

»4m e Of;")) 

<Ilo g ((iV-2r, sup (")("-")) 
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log(iV) I . fn\fN-n 

The sequence (( n )( ~ n )) n attains its supremum when n = [iV/2], and the 
statement is a consequence of (jHJ). A is a /^-continuity set. When Adl = 0, 
P rjV) jv(-A) = 0, which is consistent with / r (y) = +00 when y ^ I. When 
A takes the form A = [— c, e] with c > and £ > 0, 3 iVo £ N such 
that < r^/N < e, ViV ^ iVo, and the same argument applies. When 
A = [— c, 0], P rjVj Ar(A) = 0, and we have the inequalities defining the large 
deviation principle 

- inf r(y)<limiiifllog(P P ^(A)) 
y&A iv 

and 

hmsup 1 log(P rjViJV (A)) < - inf (9) 

The above arguments show that the upper and lower bounds © hold for 
open and compact sets. The sequence of measures {P Tn ,n) is supported 
by the unit interval, and the sequence is exponentially tight. The large 
deviation principle follows. 

When the large deviation principle is satisfied with the flat rate function 
I r , Laplace's method gives that the system exhibits a phase transition with 
respect to the order parameter M^: the mass of the integral is located near 
the infimum of the function Fe(y), and therefore, the DNA is completely 
denatured when the parameters of the problem are such that 

inf F B {y) =F B (1). 

2/6(0,1) 

This occurs for example when the helicity density k is smaller than a critical 
density k c . 

Theorem 2 The function Fb '■ I — > R attains its infimum at a unique 
point y*(B, k). Let 

y = -kA > 0, yi = - — j— (4bA + K k) and y 2 = — . 

Let R C (B) and k c (B) be the smallest roots of the polynomials P{n) = (yo — 
V2){y\ ~ IJ2) - y\ and Q(k) = P {k) - (1 - 2y 2 ), with R C (B) > k c (B). Set 
M* = 2/2 + v (2/2 — 2/0X2/2 — yi)- Then i) k > R C (B) implies that Fb is 
increasing on I with y*{B,n) = 0, it) k c (B) < k < R C {B) implies that 
y*(B,K) = M* G (0,1), with Fb decreasing below y*(B,n) and increasing 
above y*(B,n), and iii) k < k c (B) implies that Fb is decreasing on I with 
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y#(B,K) = 1. Let h : I — > R be bounded and continuous. Assume that 
rjsr/N — > as N — > oo. Then 

P rN;N (hexp(-Nf3F B )) 

-> h(y«(B,K)). 



P rN ,N(exp(-NpF B )) 



Proof: We look for the infimum of Fb on I; equivalently, we can consider 
the infimum of Fb + 2bnA + b, which is equal to 

(2bA 2 + 27r 2 C) (y-wKv-Vi) 



A 2 (y - y 2 ) 

Notice that b > implies that y± > y 2 . The function has a pole at y = 
y 2 < 0, and two roots y± and yo > 0. We see that the first part of the 
theorem (cases i), ii) and hi)) is related to the location of the infimum of 
the function with respect to (y 2 , 0), (0, 1) and (1, oo). Taking the derivative, 
we must look for the roots of y 2 — 2y 2 y + y 2 (yo + Hi) — UoUl, of discriminant 
(yo ~ y 2 ){yi - y 2 ) ^ 0, since yi > y 2 , y > and y 2 < 0, and we get 
the condition for the largest root M*. The polynomial P(k) is obtained 
by imposing M* < 0, and Q(n) by imposing M* < 1. Concerning the last 
assertion, consider the probability measure 

(A) f A eM-N/3F B (y))P rN ,N(dy) 
m{ h / je xp(-JV^B(y))Pr w> iv(dy)' 

for any Borel subset A. From Varadhan's Theorem (see e.g. Theorem 
2.1.10 and exercice 2.1.24 in Deuschel and Stroock(1989) or Theorem II. 7. 2 
in Ellis(1985)) and Lemma ^ the sequence (/ijv) satisfies a large deviation 
principle with rate function Ip{y)— inf y Ip(y), where If{v) = I r ' (y)+(3FB(y)- 
inf yg R, If{v) = ft infygi Fsiy), which is realized at a unique element y*(B, k); 
/xjv converges then weakly to the point mass S ytt m,K)- 



2.2 The heteropolymer Case 

In the heteropolymer case, the field B N = (6«)i^i^Ar is indexed by a word 
of length N on the two letters alphabet {A + T, G + C}. Given B N , let 

1 N 

i=l 

where I(-) is the indicator function. Given some proportion p + £ I, we shall 
consider families (B n )n^ of words with p~^(B N ) — ► p + . Let 

1 N _ 1 N 

m%(a) = -jy^Ubi = b A r)o-i and m N (a) = j^^Kh = bac)^i, 

i=l i=l 
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with to 7v = TO^y + to^t , and consider the mapping ^ b n : f2 at — ► R 2 given 
by ^ B N(a) = (m^(a),m~^(a)), which permits to control the eventual local- 
ization of the magnetization (see Mathieu and Picco(1998) for other use of 
this mapping in random fields Curie Weiss models). Let Vjv be a probability 
measure on O/y , and let Qn be the image measure of Vat under \& b n . We 
consider the behavior of and mjj under the Gibbs measure 

_ V N (a)eM-NPF BN (a)) 
^,BN,v N (<r) - z N (f3,B",V N ) ' 

where we set 

I N 

F BN ( a ) = G (M N (a)) + - Y, b ^- 

i=l 

Let us denote by B GG and Bat the fields associated with the GC and 
AT homopolymers, with critical superhelical densities R C (B GG ), k c(B gg ), 
k c (Bat) and limiting proportion of broken bonds y*(B GG , K ) ( see Theorem 

El). 

Theorem 3 Let (B n )n & n be a sequence of words of {bAT,b GG } N with 
p^(B N ) — ► p + , for some p + € I. Suppose that the family of probabil- 
ity measures Qn satisfies a large deviation principle with rate function Id '■ 
R 2 — > [0, +00) given by Id(v) = 0, y € D and ir>(y) = +oo ; y G D c , where 
D = [— p + , p + ] x [— p~,p~~]. Let k be such that k c (Bgc) < k < k c (Bat)- 
Assume that p + > y*(B GG , Then denaturation is localized on the AT do- 
main, that is < mtr > w — ► p + , and < m7 T > w — ► — p~ , where 

' jv "g,B N ,v N r ' -<V f3,B N ,V N r ' 

we set p~ = 1 — p + , the proportion of GC bonds. Conversely, assume that 
p + < y*(B GC ,K): Then < >n^ B N VN — > P + , and < m N >-!t bN vn — > 
2y*(B GG , k) - 1 - p + > -p". 

Remark 1 If the parameters of the problem are such that k c {Bqq) < k c (Bat) 
(recall that the fields are temperature dependent, see $]))), and k is so that 
R c (Bgc) < K < k c (Bat), denaturation is localized on the AT domain, for 
arbitrary proportion p + of AT bonds. However, when k < R C (B GG ), denat- 
uration is localized on the AT domain when p + > y*(B GG , K ) an d denatura- 
tion expands beyond the AT domain when p + < y*(B GG , K )- 

Proof: We must evaluate the asymptotic behavior of 

Z a VN(o-)exp(-NpF B M(o-))h(m+(a),m-(o-)) 
E CT Viv(a)exp(-iV/3F B iv(a)) 

for functions h of the two variables (m^,m^). Notice that 

F BN (a) = G{M N {<j))+b AT m+{a) + b GC m N {(j) 

= G(M N (a)) + b GG m N (a) + (b AT - b GG )m] <[ {a), 
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Let F(m + ,m ) = G{M) + b A Tm + + b GG m > where m = m + + m and 
M = (m + l)/2. We must thus check the behavior of the expectation 

fM = /^^)exp(-iV/?F(y))Q j v(dy) 
MJVW fexp(-N/3F(y))Q N (dy) ' 

for bounded and continuous functions h. From Varadhan's Theorem, the 
family of probability measures pn satisfies a large deviation principle with 
rate function I F (y) — ia£ y I F (y), where I F (y) = I D (y) + /3F(y). M y I F (y) = 
[3'va.fy^D F{y). If this infimum is realized at a unique point y* of D, the 
sequence of measures p at converges weakly to the Dirac mass 5 Vt . We thus 
look for the minima of F on D. 

inf F(m ,m~) 

(m+,m~)G-D 

= inf (G(M) + b GG m + (b A T - b GG ) sup m + ) 

m l^l (m+ ,m~): m~+m+=m 

Given, p + , consider m such that M = (m + l)/2 ^ p + , that is m ^ 2/9 + — 1. 
Then sup m + +m - =m m + = /9 + (corresponding to (fTU|) below)): when m + = 
p + , one obtains m~ = m — p + and the pair (m + , m~) is element of -D since 
m - = m — p + ^ p~ if and only if m ^ /9 + + p~ = 1 and m~ ^ — if and 
only if m ^ p + — p~ = 2p + — 1 Similarly, when m is such that M ^ p + , 
the maximal possible value of m + is m + = 2M — p + (corresponding to 
Qlljl below). In summary the infimum is obtained by taking the minimum 
between 

inf G(M)+b GC m + (b AT -b GC )p + , (10) 

and 

inf G{M) + b AT m + p-(b AT -b GC ). (11) 

M<p+ 

We next use the hypotheses. From Theorem [21 k < k c (B af ) implies that 
G(M)+b A T r n attains its infimum when m* = 1, or M* = 1, and is decreasing 
on the unit interval; Qlljl becomes 

G{p + ) + b AT (2p + - 1) + (1 - p + )(b AT - b GC ) = G(p + ) + p + b AT - pb GG . 

Concerning (|10|) . k > k c (B gg ) and, from Theorem|2l the function G(M) + 
b GG m attains its minimum at y*(B GG , k), and is increasing above this point. 
Thus 

inf G(M) + b GC m = G{p + ) + (2p + - l)b GC , 

when p + ^ V*(Bqc, k). Then, both Ijlfljl and 1)11(1 are minimized for M = p + , 
which is the maximal value of m + . Thus m + = p + and m~ = m — m + = 
2p + — 1 — p + = —p~, as required. 
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Conversely, assume that p + < 1/*(Bqc, k). Then (|rU|) becomes 
inf G(M) + b GC m + (b AT - b GC )p + 

= G(y*(B GC , «)) + b GC (2y*(B GC , «) - 1) + (6 AT - 6 G c)p + - 
When M < p + , the infimum is still given by 

+ b ATP + - p-b GC = G{p + ) + b GG (2p + - 1) + p+b Ar - p+b GC , 

and one obtains that the minimum is realized when M = y*(B GG , k), and 
therefore m + = p + andm" = m—m + = 2M— 1— m + = 2y*(B GG , k) — 1— p + , 
as required. 

In the remaining, we give an example of probability measure Vn on 0,^ 
such that Qn = Vn ° satisfies a large deviation principle with rate 

function Ijj. Given a sequence (B N ), consider the family of spins a N given 
by = l(bf = 6 at) — = b GG ). Given a sequence rjv, consider the 
restricted Ising measure 

t/ / \ ^ / x i(kl < 2^)^^) 



Lemma 2 Let (.B^) ngn &e a sequence of words such that p~j^ — > p + G I, 
and ct^ G Cf Ni N> for some sequence (riv)jveN- Assume that fjv + 2 ^ r^v ; 
and i/tai r^/N — ► as iV — > oo. T/ien i/te probability measure Qn = 
TTf3 a7 r N o ^ B lf satisfies a large deviation principle with rate function I p. 

This is an extension of the homopolymer case: fj^/N — * means that the 
word associated with the DNA is formed of relatively large droplets of A+T 
bonds alternating with similar droplets of G+C bonds. 

Proof: When r N /N -> 0, © implies that log(V N (a N ))/N -> for any 
sequence of spins (& n )n with |er^| 2rjv- Now, given an open subset A 
of 79°, containing some point A = (A + ,A _ ), with |A + | < p + and |A _ | < p~, 
consider the sequence Xn = ([NX + ]/N, [NX~]/N), which is in A for A ^ 
N (A,X). Then 

3d N G Cf N , N with f N < Fjv+1 and V BN (a N ) = X N , VA ^ AT (A, A). (12) 
Suppose that (|T2*|) is true: Then one obtains 

V N (a N ) ^ Q N ({X N }) < Qjv(A) < 1, 



and it follows that log(Qjv(^4))/A — ► as A — ► oo. In this case, A is a Id- 



continuity set. The main property to check is thus (|12[). Set A^ = [A ± A]/A, 
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and define = (A^ + pj^)/2, where p N = 1 — p%- Choose an origin in the 
circular DNA at some site in with cr£f = — 1 and of = 1. We can consider 
the linear string, starting at in, with an A+T droplet, and ending with a 
G+C droplet. Let A±, ■ ■ ■ , y4. fjv be the A+T droplets, ordered according to 
their appearance along the string, and define similarly G\ , • • • , Gf N . Set a,j = 
\Aj\ and gj = \Gj\, 1 ^ j ^ fjy. The string is viewed as the juxtaposition of 
symbols A\G\ ■ ■ ■ A fN Gf N . Let T + and T~ be defined by 

j 

T+ = min{l ^ j ^ f N ; Y, a k > NM+}, 

k=i 

and 

j 

T~ = min{l j ^ f N ; Y,9k> NM N }. 

k=l 

Notice that both T + and T~ are well defined for N large enough since 
E£iafc = Np+, TZi9k = Np N , p+ -+ p+, p N - p~, M+ - (A+ + 
p + )/2 < p + and — ► (A - + p~)/2 < p~ . Let ii be the site situated in 
A T + such that 

|{i € A x U • • • U A T+ ; i^ n}| = AM+, 

and define similarly 12 for the G+C domain. Set of = +1 when i £ Ai U 
• • •UA r + and i ^ i\, of = —1 when i G A T + U- • -UAf N and i> i\, of = — 1 
when i e Gi U ■ • • U G r - , i ^ Z2, and = +1 when i £ G T - U • • • U Gf N , 
i > %2- Clearly 

5f = AM+-(Ap+-AM+) 
= iV(2M+ - p+ ) = AA+ , 

and similarly 

giving ^ b n(o n ) = Aat, as required. Next suppose without loss of generality 
that T + T~, then i\ < ii, and, from construction, of = of when i ^ i\, 
of = —of when % > ii and cr^ = —1 when i\ < % ^ ^- It follows that 
jci^l ^ 2(f7v + 1), as required. 

Now, consider a Borel subset A C D c in R 2 . For any sequence of spins 
(o N ) in O^, -p± < m^c^) < p±, and thus ^ b n(o n ) G A c for A" large 
enough, that is there exists N (A) £ N such that Qn(A) = 0, ViV > N (A). 
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3 Statistical approach 



3.1 A Bayesian model 

In the heteropolymer case, the field B N = (bi) is indexed by a word of length 
N on the two letters alphabet {A + T,G + C}. The DNA can be seen as a 
kind of geometrical code where a word is glued on the double helix. When 
the perimeter is fixed by setting Vjv = U Tj n, C r ,N = {cr G 17^; \a\ = 2r} 
is a nonlinear code, as a subset of Qn, and tools from information theory 
and bayesian statistics are of great utility for computational issues. We will 
see that the Hamiltonian of the system models the a posteriori law on the 
parameter space given the observation B N . First, consider the a priori 
probability measure on the parameter space given by 

. , V N (a) eM-N(3G(M N (a))) UZi^M-^at) + exp(-a i /36 GC ) 
vn{o-) = , 

where Zn(vn) denotes the related partition function and Vjy is an arbitrary 
probability measure onfljv. Notice that takes into account the level of 
superhelicity of the DNA through the superhelical density k, and that z/jv 
corresponds to the Gibbs measure associated with the homopolymer with 
field B, given by 

- bAT + bcc 
b = 2 ' 

that is 

V N (a)exp(-NpF B (a)) 
M(T) = Z M B,V N ) • 

The statistical model is as follows: a code word or a parameter cr° is chosen 
at random with law on supp(V/v), and is sended through a noisy channel 
with output alphabet {A + T,G + C} and memoryless channel statistics 
P(-|<7°) given by 

N N l Oau \ 
P(B N \a°) - T\v(b \a°) - TT e MzdM 

The output distribution of B N is 

q(B N ) = vn(o-)P(B n \cj), 

and the a posteriori distribution on the parameter space is the Gibbs distri- 
bution 

_ V N (a)exp(-NpF BN (a)) 
*f>,B»,v N W ~ z N (P,B",V N ) • 



15 



Notice that 

E q (p+(B N )) = 1 - 9 + (29 - 1) < M N > VN , 

where 9 = exp(— /3&Ar)/(exp(— /3&at) + exp(— 0bac)) ^ 1/2- The law of 
Mjv(c) under un is subject to threshold phenomenon, as shown in Section 
12.11 When Vn = U TN) n with rjy/N — > 0, Theorem |2] gives information 
on the limiting support of vn'- k c(B) < k < R C (B) implies that the law 
of Mjv under converges toward the point mass S y ^<B )K \ with y*(B,K) G 
(0, 1), and the average proportion of A+T bonds in B N under q is given by 
l-9 + (26-l)y*(B,K) G [1-9,9]. 

Having observed some word B N , consider the Bayes estimator for cr° 
under quadratic loss L(a,a') = \ \a — a'\\ 2 . Bayesian Theory (see e.g. Robert 
(1992)) gives that the optimal Bayes estimator under quadratic loss is 

^ _ EaeC rN WN(a)P(B N \a) 

and the estimated magnetization ^A^i/N corresponds to our order param- 
eter < mj\r >7r^ gN v , showing that Benham's model has an interesting 
statistical content. 

In the next Section, we imbeed Benham's model in a bayesian segmen- 
tation model of current use in bioinformatics, and provide new algorithms 
with priors adapted to supercoiled DNA. 

3.2 Bayesian segmentation for strand separation 
3.2.1 A two coins example 

Liu and Lawrence(1999) present a Bayesian model for segmentation of biopoly- 
mers, and provide algorithms for drawing samples from the various posterior 
laws of the model, which might be of great interest in the strand separation 
problem. We start with their two types of coins example, to fix ideas. Sup- 
pose you know that in a coin tossing game of length N, the first A Bernoulli 
have a probability of success 9\ and the next N — A tosses have a probabil- 
ity 92 7^ 9i of getting a head. Let y f, s be the observed sequence, whith h\ 
(resp. /12) heads and t\ (resp. t<z) tails in the first (resp. second) part of the 
sequence. The change point A is treated as a missing data, and has some 
prior law g(a). The likelihood of the observed data is given by 

L(9i,9 2 ; y bs,A = a) = 9^(1 - 9^9^(1 - 9 2 ) t2 g(a). 

In their work, the prior tt(#) for 9 = (61,62) is a product measure associated 
with two independent Beta random variables B(6\; ai,[3\) and B(6 2 ', a 2 , fa)- 
Let p(6\, 6 2 , a, y bs) be the joint law of all variables, with 

p(6i, 6 2 , a, y obs ) = L(6i,9 2 ; y bs,A = a)ir(6)g(a), 
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and 

P(A = a,y obs ) = J J p(9 1 ,9 2 ,a,y obs )d8 1 d9 2 . 

The following algorithm will converge and give samples from the posterior 
law of (0 1 ,e 2 ,A): 

• Fix A = a and 9 2 , and draw 9\ from its conditional posterior law 
P(8 1 \8 2 ,A = a,y obs ), to get the new 6>i, 

• proceed similarly for 9 2 to get the new 6 2 , 

• draw A from its conditional posterior law P(A = a\ 6, y bs) propor- 
tional to U i =i 2 °i l{a) ( 1 ~ 0i) tl{a) g(a), where hi(a) and t^a) are the 
number of heads and tails contained in the i-th part of the sequence, 
i = 1,2. 

Coming back to the setting of Section IXTl choose Vm to be t/i,jv, fixing 
the perimeter to 2r = 2. Benham's model deals with a circular DNA, and 
they are N configurations of length a, and two change points. Forgetting 
for a while this slight difference, we see that the Bayesian model of Liu 
and Lawrence can be adapted for Benham's model: a is the missing data 
or segmentation parameter A, and the prior law g(a) is just the a priori 
measure vjq of Section |3~T1 Suppose that Oi = +1, 1 ^ i ^ A and cij = — 1, 
A < i ^ N. Set B N = y obs and fix the parameter 9 to 

£ exp(-/?6^ T ) - = exp(/3b A T) =\-Q 

exp(-/3&AT) + exp(-/36 GC ) ' exp(/36 j4 T) + exp(/36 G c) 

(13) 

The model of Section ^, ll can be seen as a particular case of the segmentation 
model, and the a posteriori law of the change point A is the Gibbs measure 
ttq b n v«- F° r the deterministic model with constant 8, the last step of the 
algorithm corresponds to sampling with ftp b n Vu- an d Benham(1999) 
give algorithms for strand separation using transfer matrices from statis- 
tical mechanics; the method can be applied, thanks to the quadratic form 
appearing in the prior (the transfer matrices have complex entries, a con- 
sequence of the gaussian transform). The perimeter is not limited, and is 
penalized by the exponential weight appearing in the one dimensional Ising 
Boltzman weight exp(— 4r/3) for a perimeter of 2r (notice that this way of 
penalizing too large perimeters was implemented in a recent work of Ramen- 
sky et afe"i(2000,2001)). Their method is applicable to situations where the 
free energies for strand separation b A T and bcc fluctuate: in real situations, 
chemical reactions can alter these free energies. In the randomized case, the 
Bayesian segmentation model can also provide an interesting alternative for 
studying the strand separation problem with random free energies b A T and 
bee- 
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3.2.2 The general case 

We adapt the setting of Liu and Lawrence(1999) and Ramensky et al- 
iri(2000,2001), and indicate the main features of the model. The number of 
domains or segments of the circular DNA can be limited to r max . The miss- 
ing data is the spin a, with prior law un, given by Vn and the homopolymer 
of field b = (pAT + &gc)/2, with free energies bAT and bcc given by formula 
a can be seen as a juxtaposition of droplets of ± spins arranged around 
the discrete circle of length N . 

Suppose they are 2r domains with r positive droplets, that is containing 
sites i with <7j = +1, and r negative droplets. The parameter 9 is here 
defined given a. We associate to every positive droplet with <Jj _i = — 1, 
<jj = +1, io ^ i ^ ik, and <7j fe +i = —1 a family of k+1 i.i.d. Bernoulli £j,with 

p( £t = A + T)= 9 io ' ik and P( Bi = G + C) = 1 - 6 io ' ik , 

of random parameter which is chosen according to some prior law / + . 

Do the same for negative droplets for a random parameter Q3°J l of prior / _ , 
with 

P{sj = A + T) = 1 - 9 jo ' jl and P(ej = G + C) = 6 jo ' jl . 
The requirements might be 

E f+ (6 io ' tk ) =E r (9 jodl ) =6~i, 

where the constants bAT and bcc appearing in Q and (fl3|) can be taken as 
average values. Positive droplets force the sample toward A + T outcomes 
and conversely for negative droplets. The priors can be chosen as in Liu 
and Lawrence(1999) as Beta laws with the required expectations. 

A second way of randomizing the parameters consists in taking, indepen- 
dently for each segment, two positive random variables 6at(<^) and bcci 1 ^), 
distributed according to some law, which might be motivated from thermo- 
dynamics or biochemistry, with average values given by formula Draw 
2r i.i.d. realizations of these two random variables, and set, for each seg- 
ment, 

g , u \ = exp(-/3fr AT (o;)) 

exp(-/3&Ar(w)) + exp(-/3& G cM). 
The probability to get A+T is 0{lo) when the droplet is positive and 1 — 9{oS) 
when the droplet is negative. 

Let ir(9\a) be the prior law given by the above construction, with joint 
law it (6, a) = 7r(9\a)u]\f(a). Then the law of B N is 

P(B N ) = Y, M°)P(B N \a) 




18 



where P(B \0, a) is the product measure associated with the Bernoulli. The 
posterior law of interest in the strand separation problem is just P(a\B N ) = 
P{B N \a)/P(B N ). This last law is define on the cube n N , of size 2^. The 
observables of interest are the number of denatured bonds Mj^(a) given by 
Tnjf(o) = 2Mjsf(a) — 1, and the restricted magnetization and m^. Infor- 
mation on the localization of denaturation can be obtained by considering 
the proportion of broken A+T and G+C bonds 

± _ m% + p% 
m n ~ 2 ' 

(see Section I2.2j) . Bayesian segmentation algorithms like backward sam- 
pling, as given in Liu and Lawrence(1999) or Schmidler et altri(2000), can 
thus be applied to the strand separation problem to study local denatura- 
tion as function of the various parameters, taking into account the random 
fluctuations of the free energies needed to denature A+T and G+C bonds. 
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